ALI EHSANI, (matricula : 965407), MSc. students in Molecular Biotechnology and Bioinformatics(MBB)
“Single Cell” RNA-Seq analysis

Here is the report of transcriptomics project result which includes the Single Cell RNA-Seq analysis. The samples I worked with, is obtained from Left Ventricle of Heart tissue Mus musculus (Mm); The number of cells (with at least 1000 counted reads) in this project was 2730; Number of expressed genes was 24,878.

Single cells were clustered based on gene expression.

General description of this project, I have to deal with single-cell RNA-Seq data that come from the PanglaoDB. I worked on a dataset based on the 10X platform. I followed the steps of the Seurat vignette in this project.

* dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. * The goal of patchwork is to make it ridiculously simple to combine separate ggplots into the same graphic.

library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## The legacy packages maptools, rgdal, and rgeos, underpinning this package
## will retire shortly. Please refer to R-spatial evolution reports on
## https://r-spatial.org/r/2023/05/15/evolution4.html for details.
## This package is now running under evolution status 0
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, saveRDS
## Loading Seurat v5 beta version 
## To maintain compatibility with previous workflows, new Seurat objects will use the previous object structure by default
## To use new Seurat v5 assays: Please run: options(Seurat.object.assay.version = 'v5')
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(patchwork) 

For loading datasets we can retrieve the counts as a “Rdata” object, or as a compressed count matrix. I preferred first way.

load('SRA637291_SRS2749416.sparse.RData')
#OR
#Read in the expression matrix
#exp.mat <- read.table(file = "SRA637291_SRS2749416.mat.gz", header = T, as.is = TRUE, row.names = 1)

Pre-processing raw data:

Each row contains gene symbol with ensembl. By using these commands I extracted ENSEMBL of my data:

Ensembl= gsub(".*_", "", rownames(sm))
Ensembl= gsub("\\..*", "", Ensembl)

Then I keep only the gene names of my data which are in rownames:

rownames(sm) = gsub("\\_E.*", "", rownames(sm))

Here I showed some genes as an example. The dot values in the matrix represent no molecules detected. Most values in an scRNA-seq matrix are 0, Seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/inDrop/10x/“many cells few reads data.

sm[c("Gsn", "Col1a2", "Acta1","Lgals3", "Fabp4", "Ckap4"), 1:30]
## 6 x 30 sparse Matrix of class "dgCMatrix"
##   [[ suppressing 30 column names 'AAACCTGAGCAACGGT', 'AAACCTGAGGACATTA', 'AAACCTGAGGGATACC' ... ]]
##                                                                    
## Gsn    . . . . . . 1 . . . 1 . . 1 . 12 . . . . 1 . . . 1 . . . . 2
## Col1a2 . . . . . . . . . 3 . . . . . 15 . . . . . . . . . . . . . 1
## Acta1  . . . . . . . . . . . . . . .  . . . . . . . . . . . . . 1 .
## Lgals3 1 . . . . . . . . . . . . 1 1  1 . . . . . . . . 2 . . . . .
## Fabp4  . . 1 . . . . . . . . . . . .  1 . . . . . 1 . . 2 . 1 1 . 1
## Ckap4  . . . . . . . . . . . . . . .  1 . . . . . . . . . . . . . .

Create our Seurat object

In the first step I used the “Rdata” object (count matrix) to create a seurat object. The object serves as a container that contains both data (count matrix) and analysis (PCA, or clustering results) for a single-cell dataset. The name of project is LVHEART ~ Left Ventricle of Heart. After running CreateSeuratObject() function we see that this dataset contains with 3978 samples.

LVH = CreateSeuratObject(counts = sm, project = 'LVHeart', min.cells = 3, min.features = 200, names.delim = "_" )
## Warning: Non-unique features (rownames) present in the input matrix, making
## unique
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
LVH
## An object of class Seurat 
## 20495 features across 3978 samples within 1 assay 
## Active assay: RNA (20495 features, 0 variable features)
##  2 layers present: counts, data

Then I extract mitochondial genes from data:

mito_genes <- grep("^MT-|^mt-", rownames(LVH@assays$RNA@data), value = TRUE)

Then I tried to calculate mitochondrial QC matrices;

We can calculate mitochondrial QC metrics with the PercentageFeatureSet function, which calculates the percentage of counts originating from a set of features. I use the set of all genes starting with mt- as a set of mitochondrial genes:

LVH[["percent.mt"]] <- PercentageFeatureSet(LVH, pattern = "^mt-")
head(LVH@meta.data)

Now, with percent.mt, I visualize QC matrices, using VlnPlot which shows violon plot:

VlnPlot(LVH, features = c('nFeature_RNA', 'nCount_RNA', 'percent.mt'), ncol = 3, pt.size=0)

mean(LVH$nFeature_RNA)
## [1] 1260.213
max(LVH$nFeature_RNA)
## [1] 5940
min(LVH$nFeature_RNA)
## [1] 198

Moreover, with FeatureScatter() function, I compared two different features together:

plot1 = FeatureScatter(LVH, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 = FeatureScatter(LVH, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

By playing with numbers and the best choose of samples, I tried to extract suitable number of cells:

summary(LVH$nFeature_RNA > 200)  
##    Mode   FALSE    TRUE 
## logical       9    3969
summary(LVH$nFeature_RNA < 3700)      
##    Mode   FALSE    TRUE 
## logical      42    3936
summary(LVH$percent.mt < 15 )            
##    Mode   FALSE    TRUE 
## logical      56    3922

I subset dataset with n_Feature_RNA higher than 200 and lower than 3700, and mitochondial percentage of less than 15 in each cell:

I found that 106 samples are omitted now.

LVH= subset(LVH, subset = nFeature_RNA < 3700 & nFeature_RNA > 200 & percent.mt < 15)
LVH
## An object of class Seurat 
## 20495 features across 3872 samples within 1 assay 
## Active assay: RNA (20495 features, 0 variable features)
##  2 layers present: counts, data

Normalizing the data

I normalize dataset using NormalizeData() function of Seurat package which is a global-scaling normalization method. (In this part I use ‘LogNormalize’ method and scale factor of 10000):

LVH = NormalizeData(LVH, normalization.method = 'LogNormalize', scale.factor = 10000)

Identification of highly variable features (feature selection)

To restrict the gene set to the “most variable” genes, I do this, which ‘vst’ method is used and 2000 variable features are obtained by default:

LVH = FindVariableFeatures(LVH,selection.method = 'vst')
## Warning: The following arguments are not used: nselect
#10 most highly variable features are saved in top10
top10=head(VariableFeatures(LVH), 10)
top10
##  [1] "Hba-a2" "Hbb-bt" "Lgals3" "Hba-a1" "Nrg1"   "Ccl8"   "Ccl2"   "Hbb-bs"
##  [9] "Igfbp5" "Ccl7"

Then, I plot variable features with labels(top10 features):

LabelPoints(plot = VariableFeaturePlot(LVH), points = top10, repel = T)
## When using repel, set xnudge and ynudge to 0 for optimal results

Scaling the data

Apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA.

The ScaleData() function shifts the expression of each gene, so that the mean expression across cells is 0; furthermore, it scales the expression of each gene, so that the variance across cells is 1. (This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate)

  1. Scale all genes:

    This scaling step is mandatory in every processes

    all.genes= rownames(LVH)
    LVH= ScaleData(LVH, features = all.genes)
    ## Centering and scaling data matrix
  2. Scale to mt genes:

    we could use the ScaleData() function to remove unwanted sources of variation from a single-cell dataset.

    We do 2nd or 3rd steps if we find mitochondrial or cell cycle related genes as marker in our data.

    LVH = ScaleData(LVH, vars.to.regress = "percent.mt")
    ## Regressing out percent.mt
    ## Centering and scaling data matrix

A list of cell cycle markers, from Tirosh et al, 2015, is loaded with Seurat. We can segregate this list into markers of G2/M phase and markers of S phase, but at first because I use mm genes, I have to change gene names from capital to short letters except first letter:

cc.genes
## $s.genes
##  [1] "MCM5"     "PCNA"     "TYMS"     "FEN1"     "MCM2"     "MCM4"    
##  [7] "RRM1"     "UNG"      "GINS2"    "MCM6"     "CDCA7"    "DTL"     
## [13] "PRIM1"    "UHRF1"    "MLF1IP"   "HELLS"    "RFC2"     "RPA2"    
## [19] "NASP"     "RAD51AP1" "GMNN"     "WDR76"    "SLBP"     "CCNE2"   
## [25] "UBR7"     "POLD3"    "MSH2"     "ATAD2"    "RAD51"    "RRM2"    
## [31] "CDC45"    "CDC6"     "EXO1"     "TIPIN"    "DSCC1"    "BLM"     
## [37] "CASP8AP2" "USP1"     "CLSPN"    "POLA1"    "CHAF1B"   "BRIP1"   
## [43] "E2F8"    
## 
## $g2m.genes
##  [1] "HMGB2"   "CDK1"    "NUSAP1"  "UBE2C"   "BIRC5"   "TPX2"    "TOP2A"  
##  [8] "NDC80"   "CKS2"    "NUF2"    "CKS1B"   "MKI67"   "TMPO"    "CENPF"  
## [15] "TACC3"   "FAM64A"  "SMC4"    "CCNB2"   "CKAP2L"  "CKAP2"   "AURKB"  
## [22] "BUB1"    "KIF11"   "ANP32E"  "TUBB4B"  "GTSE1"   "KIF20B"  "HJURP"  
## [29] "CDCA3"   "HN1"     "CDC20"   "TTK"     "CDC25C"  "KIF2C"   "RANGAP1"
## [36] "NCAPD2"  "DLGAP5"  "CDCA2"   "CDCA8"   "ECT2"    "KIF23"   "HMMR"   
## [43] "AURKA"   "PSRC1"   "ANLN"    "LBR"     "CKAP5"   "CENPE"   "CTCF"   
## [50] "NEK2"    "G2E3"    "GAS2L3"  "CBX5"    "CENPA"
s.genes=stringr::str_to_title(cc.genes$s.genes)
s.genes
##  [1] "Mcm5"     "Pcna"     "Tyms"     "Fen1"     "Mcm2"     "Mcm4"    
##  [7] "Rrm1"     "Ung"      "Gins2"    "Mcm6"     "Cdca7"    "Dtl"     
## [13] "Prim1"    "Uhrf1"    "Mlf1ip"   "Hells"    "Rfc2"     "Rpa2"    
## [19] "Nasp"     "Rad51ap1" "Gmnn"     "Wdr76"    "Slbp"     "Ccne2"   
## [25] "Ubr7"     "Pold3"    "Msh2"     "Atad2"    "Rad51"    "Rrm2"    
## [31] "Cdc45"    "Cdc6"     "Exo1"     "Tipin"    "Dscc1"    "Blm"     
## [37] "Casp8ap2" "Usp1"     "Clspn"    "Pola1"    "Chaf1b"   "Brip1"   
## [43] "E2f8"
g2m.genes=stringr::str_to_title(cc.genes$g2m.genes)
g2m.genes
##  [1] "Hmgb2"   "Cdk1"    "Nusap1"  "Ube2c"   "Birc5"   "Tpx2"    "Top2a"  
##  [8] "Ndc80"   "Cks2"    "Nuf2"    "Cks1b"   "Mki67"   "Tmpo"    "Cenpf"  
## [15] "Tacc3"   "Fam64a"  "Smc4"    "Ccnb2"   "Ckap2l"  "Ckap2"   "Aurkb"  
## [22] "Bub1"    "Kif11"   "Anp32e"  "Tubb4b"  "Gtse1"   "Kif20b"  "Hjurp"  
## [29] "Cdca3"   "Hn1"     "Cdc20"   "Ttk"     "Cdc25c"  "Kif2c"   "Rangap1"
## [36] "Ncapd2"  "Dlgap5"  "Cdca2"   "Cdca8"   "Ect2"    "Kif23"   "Hmmr"   
## [43] "Aurka"   "Psrc1"   "Anln"    "Lbr"     "Ckap5"   "Cenpe"   "Ctcf"   
## [50] "Nek2"    "G2e3"    "Gas2l3"  "Cbx5"    "Cenpa"

Score cell cycle phases using CellCycleScoring() function

LVH=CellCycleScoring(LVH,s.features = s.genes,g2m.features = g2m.genes,set.ident = T)
## Warning: The following features are not present in the object: Fen1, Ung,
## Uhrf1, Mlf1ip, Ubr7, Usp1, not searching for symbol synonyms
## Warning: The following features are not present in the object: Ube2c, not
## searching for symbol synonyms
head(LVH[[]])
  1. Scale to cc genes:

    #LVH = ScaleData(LVH, vars.to.regress = c("S.Score", "G2M.Score"), features = rownames(LVH))

Perform linear dimensional reduction

First of all I have to check that mt genes or cell cycle genes are not in my data;

* By doing this plot, I found that no mt or cc genes are in my data. Because when we run a PCA on cell cycle genes, it reveals tha cells do not separate entirely by phase. It approved that my dataset does not need any regression out of mitochondrial or cell cycle genes

LVH= RunPCA(LVH,features = c(s.genes,g2m.genes))
## Warning in PrepDR(object = object, features = features, verbose = verbose): The
## following 48 features requested have not been scaled (running reduction without
## them): Mcm5, Fen1, Mcm2, Mcm4, Rrm1, Ung, Gins2, Prim1, Uhrf1, Mlf1ip, Rfc2,
## Rpa2, Nasp, Rad51ap1, Wdr76, Slbp, Ccne2, Ubr7, Pold3, Msh2, Cdc45, Cdc6, Exo1,
## Tipin, Dscc1, Blm, Casp8ap2, Usp1, Pola1, Chaf1b, Brip1, Ube2c, Ndc80, Nuf2,
## Tmpo, Anp32e, Tubb4b, Hn1, Cdc25c, Kif2c, Rangap1, Ncapd2, Dlgap5, Lbr, Ckap5,
## Ctcf, G2e3, Cbx5
## Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
## a percentage of total singular values, use a standard svd instead.
## Warning in irlba(A = t(x = object), nv = npcs, ...): did not converge--results
## might be invalid!; try increasing work or maxit
## Warning: Requested number is larger than the number of available items (49).
## Setting to 49.

## Warning: Requested number is larger than the number of available items (49).
## Setting to 49.

## Warning: Requested number is larger than the number of available items (49).
## Setting to 49.

## Warning: Requested number is larger than the number of available items (49).
## Setting to 49.

## Warning: Requested number is larger than the number of available items (49).
## Setting to 49.
## PC_ 1 
## Positive:  Psrc1, Pcna, Mcm6, Cdca7, Clspn, E2f8, Dtl, Bub1, Tyms, Hells 
##     Atad2, Ttk, Gmnn, Gtse1, Tacc3, Gas2l3, Smc4, Kif11, Nek2, Rad51 
##     Hjurp, Aurkb, Ckap2l, Ect2 
## Negative:  Mki67, Tpx2, Top2a, Cenpf, Birc5, Fam64a, Cks2, Cdca8, Kif20b, Anln 
##     Hmmr, Cenpe, Cenpa, Kif23, Ccnb2, Cdca3, Cdc20, Hmgb2, Nusap1, Ckap2 
##     Cks1b, Aurka, Rrm2, Cdk1 
## PC_ 2 
## Positive:  Cenpa, Cdc20, Cenpf, Ect2, Fam64a, Tpx2, Nek2, Cks2, Ccnb2, Cenpe 
##     Kif20b, Hmmr, Aurka, Birc5, Ckap2l, Anln, Gas2l3, Cdca3, Psrc1, Ckap2 
##     Mki67, Gtse1, Cdca8, Kif11 
## Negative:  Hells, Atad2, Dtl, Mcm6, Gmnn, Pcna, Rad51, Cdca7, Clspn, Tyms 
##     Rrm2, Top2a, E2f8, Cks1b, Hmgb2, Hjurp, Smc4, Aurkb, Cdk1, Ttk 
##     Cdca2, Bub1, Nusap1, Kif23 
## PC_ 3 
## Positive:  Cenpa, Cks2, Birc5, Hmgb2, Gmnn, Ckap2l, Atad2, Cks1b, Hells, Tyms 
##     Cdc20, Ccnb2, Dtl, Mcm6, Anln, Kif11, E2f8, Nek2, Cenpf, Clspn 
##     Gtse1, Tpx2, Aurka, Mki67 
## Negative:  Nusap1, Cdca2, Gas2l3, Top2a, Hjurp, Cdk1, Ttk, Tacc3, Rad51, Psrc1 
##     Kif20b, Pcna, Cenpe, Aurkb, Rrm2, Cdca3, Hmmr, Ckap2, Cdca7, Cdca8 
##     Smc4, Ect2, Kif23, Fam64a 
## PC_ 4 
## Positive:  Gas2l3, Gmnn, Pcna, Cdca7, Cks1b, Birc5, Hells, Cdca3, Cenpa, Ccnb2 
##     Tacc3, Hmgb2, Mcm6, Fam64a, Kif20b, Cdc20, Kif23, Rad51, Dtl, Clspn 
##     Ttk, Tpx2, Nusap1, Aurka 
## Negative:  Smc4, Tyms, Atad2, Psrc1, Ckap2, Mki67, Hjurp, Cenpe, Cks2, Ckap2l 
##     Top2a, Rrm2, Cdk1, Ect2, Cdca2, Bub1, Cdca8, Anln, Aurkb, Nek2 
##     Hmmr, Cenpf, Kif11, Gtse1 
## PC_ 5 
## Positive:  Cks1b, Tacc3, Top2a, Nusap1, Cdca7, Cdca8, Birc5, Cenpe, Rrm2, Psrc1 
##     Dtl, Rad51, Clspn, Atad2, Gmnn, Hjurp, Mcm6, Cdca2, Fam64a, Aurkb 
##     Ect2, Cdk1, Hells, E2f8 
## Negative:  Pcna, Tyms, Gas2l3, Anln, Cks2, Kif11, Kif20b, Cdc20, Ccnb2, Hmgb2 
##     Ckap2, Mki67, Nek2, Cenpa, Kif23, Ttk, Aurka, Bub1, Gtse1, Tpx2 
##     Cdca3, Hmmr, Cenpf, Ckap2l
DimPlot(LVH)

perform PCA on the scaled data:

By default, only the previously determined variable features are used as input, but we can use “features” argument if we wish to choose a different subset.

LVH= RunPCA(LVH,features = VariableFeatures(object = LVH))
## PC_ 1 
## Positive:  Fcer1g, Tyrobp, Ctss, C1qb, C1qc, Fcgr3, Alox5ap, C1qa, Ms4a6c, Laptm5 
##     Ctsc, Lyz2, Cd14, Csf1r, Trem2, Spi1, Cd74, Cd68, C5ar1, Wfdc17 
##     Pld4, Ccl6, Mafb, Ms4a7, H2-Aa, F13a1, Ms4a6b, Coro1a, Cybb, Aif1 
## Negative:  Sparc, Fstl1, Bgn, Col1a2, Dcn, Col3a1, Col1a1, Serpinh1, Ccdc80, Mfap5 
##     Mgp, Col8a1, Fbln2, Lhfp, Postn, Cygb, Lum, Aspn, Pcolce, Ctgf 
##     Serping1, Col5a2, Loxl1, Col6a1, Pam, Fbn1, Sdc2, Pi16, Sfrp1, Nfix 
## PC_ 2 
## Positive:  Fabp4, Egfl7, Cd36, Rbp7, Gpihbp1, Flt1, Cdh5, Cav1, Rgcc, Tspan13 
##     Ctla2a, Sdpr, Adgrf5, Pecam1, Kdr, Kitl, Tcf15, Mgll, Podxl, Ptprb 
##     Cd93, Lims2, RP24-171J16.5, Slc9a3r2, RP23-310J6.1, Clec1a, Ly6c1, S1pr1, Cd300lg, Esam 
## Negative:  Col1a2, Col3a1, Col1a1, Dcn, Col8a1, Gsn, Bgn, Lgals1, Mfap5, Ccdc80 
##     Pmp22, Gpx3, Lum, Postn, Fstl1, Loxl1, Col5a2, Col6a1, Htra3, Ogn 
##     Sdc2, Serping1, Adamts2, Vcan, Sfrp1, Gas7, Aspn, Smoc2, Mgp, Abca8a 
## PC_ 3 
## Positive:  Myl3, Myl2, Fabp3, Tnnc1, Tnnt2, Tnni3, Csrp3, Ankrd1, Actc1, Chchd10 
##     Acta1, Cox8b, Hspb6, Myoz2, Hspb7, Cox6a2, Hba-a1, Pln, Smpx, Nmrk2 
##     Cox7a1, Pgam2, Hbb-bt, Tuba4a, Des, Hbb-bs, Nppb, Ckmt2, Tcap, Hba-a2 
## Negative:  Flt1, Arhgap31, Eng, Klf4, Jun, Cav1, Egfl7, Ptprb, Cdh5, Sdpr 
##     Rdx, Timp3, Slc9a3r2, Pecam1, Slfn5, Adgrf5, Slc6a6, Heg1, RP23-81C12.1, Cd93 
##     Emcn, Tubb5, Tm4sf1, Cd36, Id1, Ly6a, S1pr1, Ece1, Ctla2a, Ly6c1 
## PC_ 4 
## Positive:  Npr3, Nrg1, Cpe, Smoc1, Gpm6a, Bex4, Cgnl1, Plvap, Vwf, Plagl1 
##     Col23a1, Sost, Bmx, Mal, Enpp6, Edn1, Tmem2, Hmcn1, Tmem108, Bmp6 
##     Epha7, Spint2, Kank1, Dpp4, Foxc1, Bace2, Pde3a, Chrm2, Crim1, Ahsg 
## Negative:  Ly6c1, Sparcl1, Ly6a, Rgcc, Gpihbp1, Rbp7, RP24-171J16.5, Tcf15, Cd36, Gdpd3 
##     Cxcl12, Mgll, Lpl, Btnl9, RP24-293E17.2, RP23-310J6.1, Cd300lg, Fabp4, Fam101b, RP23-52N2.1 
##     Kitl, Sparc, C1qtnf9, Id1, Sox17, Aplnr, Lims2, Slc26a10, Kcna5, Nav3 
## PC_ 5 
## Positive:  Mki67, Tpx2, Top2a, Pclaf, Birc5, Cenpf, Casc5, Ccna2, Fam64a, Prc1 
##     Cks2, Cenpm, Cdca8, Knstrn, Kif20b, Hmmr, Hist1h2ao, Anln, Smc2, Cenpe 
##     Spc25, Ccnb2, Kif23, Cenpa, Cdca3, Diaph3, Ccnb1, Hist1h2ak, Hmgb2, Lockd 
## Negative:  Cfh, Gas6, Ogn, Wisp2, Htra3, Crispld2, Apoe, Cebpd, Fcgrt, Dhrs3 
##     Slfn5, Ahnak2, Cyp1b1, Gxylt2, Cygb, Scn7a, Abca8a, Mmp2, Nupr1, Pi16 
##     Stc2, Ctsb, Ctsd, Frzb, Adamtsl2, Sfrp1, F13a1, Ccdc80, Igf1, Clec3b
## Warning: Number of dimensions changing from 48 to 50

Seurat provides several useful ways of visualizing both cells and features that define the PCA, including VizDimReduction(), DimPlot(), and DimHeatmap()

Examine and visualize PCA results in different ways:

print(LVH[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1 
## Positive:  Fcer1g, Tyrobp, Ctss, C1qb, C1qc 
## Negative:  Sparc, Fstl1, Bgn, Col1a2, Dcn 
## PC_ 2 
## Positive:  Fabp4, Egfl7, Cd36, Rbp7, Gpihbp1 
## Negative:  Col1a2, Col3a1, Col1a1, Dcn, Col8a1 
## PC_ 3 
## Positive:  Myl3, Myl2, Fabp3, Tnnc1, Tnnt2 
## Negative:  Flt1, Arhgap31, Eng, Klf4, Jun 
## PC_ 4 
## Positive:  Npr3, Nrg1, Cpe, Smoc1, Gpm6a 
## Negative:  Ly6c1, Sparcl1, Ly6a, Rgcc, Gpihbp1 
## PC_ 5 
## Positive:  Mki67, Tpx2, Top2a, Pclaf, Birc5 
## Negative:  Cfh, Gas6, Ogn, Wisp2, Htra3
#plot data on first and second PCA dimensions:
VizDimLoadings(LVH,dims = 1:2, reduction = 'pca')

#Also we can use the 'reduction' argument showing what dimensional reduction we use:
DimPlot(LVH, reduction = 'pca')

#We can plot data projected on any of the PCA dimensions:
DimPlot(LVH, reduction = "pca", dims = c(3,4))

DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses:

#It shows only one dimention:
DimHeatmap(LVH, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(LVH, dims = 1:12, cells = 500, balanced = TRUE)

Determine the ‘dimensionality’ of the dataset:

‘Elbow plot’ is a heuristic method to rank principle components based on the percentage of variance explained by each one:

ElbowPlot(LVH)

Clustering the cells; Now I apply modularity optimization techniques such as the Louvain algorithm (default) to iteratively group cells together

The goal is optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering. I choose complete dimentions then I set this parameter 0.7 which returns good results for single-cell datasets.

* Optimal resolution often increases for larger datasets.

LVH=FindNeighbors(LVH, dims = 1:20)
## Computing nearest neighbor graph
## Computing SNN
LVH=FindClusters(LVH,resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3872
## Number of edges: 131346
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9135
## Number of communities: 17
## Elapsed time: 0 seconds
#Look at cluster IDs of the first 5 cells:
head(Idents(LVH), 5)
## AAACCTGAGGGATACC AAACCTGCAGTTAACC AAACCTGGTAACGACG AAACCTGGTATATGAG 
##                5                3                0                1 
## AAACCTGGTCGGCACT 
##                1 
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Perform non-linear dimensional reduction (UMAP/tSNE):

The goal of non-linear dimensional reduction is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots.

In this regard I used the same PCs as input to the clustering analysis.

* reticulate performs us to do UMAP process.

library(reticulate)
LVH=RunUMAP(LVH, dims = 1:20)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 13:07:21 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:07:21 Read 3872 rows and found 20 numeric columns
## 13:07:21 Using Annoy for neighbor search, n_neighbors = 30
## 13:07:21 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:07:22 Writing NN index file to temp file /var/folders/xh/05rxt7y54b1bzmb_t5z5wg080000gn/T//Rtmpxc7ZqC/file29ad54f41132
## 13:07:22 Searching Annoy index using 1 thread, search_k = 3000
## 13:07:22 Annoy recall = 100%
## 13:07:22 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:07:23 Initializing from normalized Laplacian + noise (using RSpectra)
## 13:07:23 Commencing optimization for 500 epochs, with 160554 positive edges
## 13:07:27 Optimization finished

individual clusters

DimPlot(LVH, reduction = "umap")

table(Idents(LVH))
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16 
## 768 625 581 430 301 239 174 161 155 116  81  55  49  48  39  28  22
#save the object:
saveRDS(LVH, file = 'LVH.rds')

The final step is to give an “identity” to the clusters

Finding differentially expressed features (cluster biomarkers):

#As an example in cluster 1:
cluster1.markers = FindMarkers(LVH, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
# find all markers of cluster 2:
cluster2.markers <- FindMarkers(LVH, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(LVH, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)

Now, finding the markers of ‘all clusters’:

In is case, 25% of the cells are chosen with a log2 fold change = 0.25 as threshold.

LVH.markers = FindAllMarkers(LVH, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster Fibroblasts
## Calculating cluster Endothelial cells
## Calculating cluster Macrophages
## Calculating cluster Erythroid-like and erythroid precursor cells
## Calculating cluster Unknown
## Calculating cluster Pericytes
## Calculating cluster T memory cells
## Calculating cluster Cardiomyocytes
#Top two markers of each cluster:
Markers_top2=LVH.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC)
Markers_top2

Seurat includes several tools for visualizing marker expression.

VlnPlot (shows expression probability distributions across clusters), and

FeaturePlot (visualizes feature expression on a tSNE or PCA plot) are our most commonly used visualizations.

We also suggest exploring RidgePlot, CellScatter, and DotPlot as additional methods to view your dataset.

Example: “Gsn”, “C2”, “Csf1r”

#As we can see in violin plot, "Gsn" could not be a marker, because, it is highly expressed in different clusters,
#Furthermore, "C2" is highly expressed in Fibroblast cell but in this analysis, we can see that this gene is expressed in B cells.
VlnPlot(LVH, features = c("Gsn", "C2", "Csf1r"), pt.size=0) 

#the typical plot in articles:
FeaturePlot(LVH, features = c("Gsn", "C2", "Csf1r"))

#The size of the dot: corresponds to the percentage of cells expressing the feature in each cluster. 
#The color: represents the average expression level
DotPlot(LVH, features = c("Gsn", "Itga1", "Acta1", "Col1a2", "Fabp4", "Itgam", "Lgals3", "Itga6", "Tnxb", "Ckap4", "Ccl8", "Mmp2") ) + RotatedAxis()

I tried to find markers manually from two data that I obtaied :

  1. Markers_top2

  2. HeartmmMarker

    * Moreover I got help from Cell type gene expression markers of PanglaoDB

    * I could not find any marker gene for “Erythroid-like and erythroid precursor cells”

    * Also, I could not find the type of “Unknown” cell in the cluster

    Mrkr_LVH = c("Col8a1", "Rgcc", "Tnnc1", "Nrg1", "Ttn", "Higd1b",'Gpihbp1', 'Ccr5', 'Kit', "Vcam1", "Myl7", "Mmrn1", "Hand2", "Lyve1", "Itgam", "Mmp2", "Gja1", "Ednrb", "Actc1", "Alcam")
#Fibroblast, Endothelial, T cell
VlnPlot(LVH, features = c("Col8a1", "Rgcc","Tnnc1"), pt.size=0) 

#Unknown, Unknown, Pericyt
VlnPlot(LVH, features = c("Nrg1","Ttn", "Higd1b"), pt.size=0)

#Unknown, B cell, Mesothelial
VlnPlot(LVH, features = c("Vcam1", "Myl7", "Mmrn1"), pt.size=0)

#Unknown, Mesothelial, Cardiomyocyte
VlnPlot(LVH, features = c( "Hand2", "Lyve1", "Itgam"), pt.size=0)

#Fibroblast, B cell, Pericyte
VlnPlot(LVH, features = c( "Mmp2","Gja1", "Ednrb"), pt.size=0)

#T memmory cell, Cardiomyocyte, Macrophage
VlnPlot(LVH, features = c( "Actc1", "Alcam", "Syk"), pt.size=0)

#Endothelial, Macrophage, Erythroid-like and erythroid precursor cells
VlnPlot(LVH, features = c('Gpihbp1', 'Ccr5', 'Kit'), pt.size=0, slot = "counts", log = T) 
## Warning: The `slot` argument of `VlnPlot()` is deprecated as of Seurat 5.0.0.
## ℹ Please use the `layer` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

DotPlot(LVH, features = Mrkr_LVH ) + RotatedAxis()

DoHeatmap(LVH, features = Mrkr_LVH) + NoLegend()
## Warning in DoHeatmap(LVH, features = Mrkr_LVH): The following features were
## omitted as they were not found in the scale.data slot for the RNA assay: Kit

Generates an expression heatmap for given cells and features. In this case, first we are plotting the top marker for each cluster. then top 10 markers in each cluster:

top1 = LVH.markers %>% group_by(cluster) %>% top_n(n = 1, wt = avg_log2FC)
top1
DoHeatmap(LVH, features = top1$gene) + NoLegend()

top10 = LVH.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
DoHeatmap(LVH, features = top10$gene) + NoLegend()
## Warning in DoHeatmap(LVH, features = top10$gene): The following features were
## omitted as they were not found in the scale.data slot for the RNA assay: Ebf1,
## Ltbp4

Now, we have the markers and the cell types associated with each cluster were already known.

DimPlot(LVH, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

Additional:

I tried to find the type of cluster of each marker without knowing cell types:

#I go back before do clustering:
#saveRDS(LVH, file = 'LVH.rds')
#LVH.markers = FindAllMarkers(LVH, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

#Mrkr_LVH = c("Col8a1", "Rgcc", "Tnnc1", "Nrg1", "Ttn", "Higd1b", "Vcam1", "Myl7", "Mmrn1", "Hand2", "Lyve1", "Itgam", "Mmp2", "Gja1", "Ednrb", "Actc1", "Alcam")

#VlnPlot(LVH, features = c("Col8a1", "Rgcc","Tnnc1"), pt.size=0) 
#Fibroblast, Endothelial, T cell

#VlnPlot(LVH, features = c("Nrg1","Ttn", "Higd1b"), pt.size=0)
#Unknown, Unknown, Pericyt

#VlnPlot(LVH, features = c("Vcam1", "Myl7", "Mmrn1"), pt.size=0)
#Unknown, B cell, Mesothelial

#VlnPlot(LVH, features = c( "Hand2", "Lyve1", "Itgam"), pt.size=0)       #Lyve1   Itgam
#Unknown, Mesothelial, Cardiomyocyte

#VlnPlot(LVH, features = c( "Mmp2","Gja1", "Ednrb"), pt.size=0)
#Fibroblast, B cell, Pericyte

#VlnPlot(LVH, features = c( "Actc1", "Alcam", "Syk"), pt.size=0)
#T memmory cell, Cardiomyocyte, Macrophage

#DotPlot(LVH, features = Mrkr_LVH ) + RotatedAxis()
#DoHeatmap(LVH, features = Mrkr_LVH) + NoLegend()

#cluster3.markers <- FindMarkers(LVH, ident.1 = 3, min.pct = 0.25)
#head(cluster3.markers, n = 10)

#cluster4.markers <- FindMarkers(LVH, ident.1 = 4, min.pct = 0.25)
#head(cluster4.markers, n = 10)

#cluster8.markers <- FindMarkers(LVH, ident.1 = 8, min.pct = 0.25)
#head(cluster8.markers, n = 10)

#cluster13.markers <- FindMarkers(LVH, ident.1 = 13, min.pct = 0.25)
#head(cluster13.markers, n = 10)

#cluster14.markers <- FindMarkers(LVH, ident.1 = 14, min.pct = 0.25)
#head(cluster14.markers, n = 10)

Materials and tools that have been used in this project are available here:

sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reticulate_1.30         patchwork_1.1.2         dplyr_1.1.2            
## [4] Seurat_4.9.9.9049       SeuratObject_4.9.9.9085 sp_1.6-1               
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.7.2       Rtsne_0.16             colorspace_2.1-0      
##   [4] deldir_1.0-9           ellipsis_0.3.2         ggridges_0.5.4        
##   [7] RcppHNSW_0.4.1         spatstat.data_3.0-1    rstudioapi_0.14       
##  [10] farver_2.1.1           leiden_0.4.3           listenv_0.9.0         
##  [13] ggrepel_0.9.3          RSpectra_0.16-1        fansi_1.0.4           
##  [16] codetools_0.2-19       splines_4.2.3          cachem_1.0.8          
##  [19] knitr_1.43             polyclip_1.10-4        spam_2.9-1            
##  [22] jsonlite_1.8.5         ica_1.0-3              cluster_2.1.4         
##  [25] png_0.1-8              uwot_0.1.14            spatstat.sparse_3.0-1 
##  [28] shiny_1.7.4            sctransform_0.3.5      compiler_4.2.3        
##  [31] httr_1.4.6             Matrix_1.5-4.1         fastmap_1.1.1         
##  [34] lazyeval_0.2.2         limma_3.54.2           cli_3.6.1             
##  [37] later_1.3.1            htmltools_0.5.5        tools_4.2.3           
##  [40] igraph_1.4.3           dotCall64_1.0-2        gtable_0.3.3          
##  [43] glue_1.6.2             RANN_2.6.1             reshape2_1.4.4        
##  [46] Rcpp_1.0.10            scattermore_1.2        jquerylib_0.1.4       
##  [49] presto_1.0.0           vctrs_0.6.3            nlme_3.1-162          
##  [52] spatstat.explore_3.2-1 progressr_0.13.0       lmtest_0.9-40         
##  [55] spatstat.random_3.1-5  xfun_0.39              stringr_1.5.0         
##  [58] globals_0.16.2         mime_0.12              miniUI_0.1.1.1        
##  [61] lifecycle_1.0.3        irlba_2.3.5.1          goftest_1.2-3         
##  [64] future_1.32.0          MASS_7.3-60            zoo_1.8-12            
##  [67] scales_1.2.1           promises_1.2.0.1       spatstat.utils_3.0-3  
##  [70] parallel_4.2.3         RColorBrewer_1.1-3     yaml_2.3.7            
##  [73] pbapply_1.7-0          gridExtra_2.3          ggrastr_1.0.2         
##  [76] ggplot2_3.4.2          sass_0.4.6             stringi_1.7.12        
##  [79] highr_0.10             fastDummies_1.6.3      rlang_1.1.1           
##  [82] pkgconfig_2.0.3        matrixStats_1.0.0      evaluate_0.21         
##  [85] lattice_0.21-8         tensor_1.5             ROCR_1.0-11           
##  [88] purrr_1.0.1            labeling_0.4.2         htmlwidgets_1.6.2     
##  [91] cowplot_1.1.1          tidyselect_1.2.0       parallelly_1.36.0     
##  [94] RcppAnnoy_0.0.20       plyr_1.8.8             magrittr_2.0.3        
##  [97] R6_2.5.1               generics_0.1.3         withr_2.5.0           
## [100] pillar_1.9.0           fitdistrplus_1.1-11    abind_1.4-5           
## [103] survival_3.5-5         tibble_3.2.1           future.apply_1.11.0   
## [106] crayon_1.5.2           KernSmooth_2.23-21     utf8_1.2.3            
## [109] spatstat.geom_3.2-1    plotly_4.10.2          rmarkdown_2.22        
## [112] grid_4.2.3             data.table_1.14.8      digest_0.6.31         
## [115] xtable_1.8-4           tidyr_1.3.0            httpuv_1.6.11         
## [118] munsell_0.5.0          beeswarm_0.4.0         viridisLite_0.4.2     
## [121] vipor_0.4.5            bslib_0.5.0

References:

  1. Seurat V4;

    https://satijalab.org/seurat/

  2. Giulio Pavesi; Seurat example on 10x Data (2021 updated);

    http://159.149.160.56/Transcriptomics/seurat.html

  3. CellMarker: a manually curated resource of cell markers in human and mouse

    DOI: 10.1093/nar/gky900

    http://biocc.hrbmu.edu.cn/CellMarker/

  4. Tabula Muris

    https://tabula-muris.ds.czbiohub.org/

  5. PanglaoDB

    https://panglaodb.se/index.html